Joeyonng
  • Notebook
  • Pages
  • About
  • Backyard
  1. Probability and Statistics
  2. 18  Probability
  • Welcome
  • Notations and Facts
  • Linear Algebra
    • 1  Fields and Spaces
    • 2  Vectors and Matrices
    • 3  Span and Linear Independence
    • 4  Basis and Dimension
    • 5  Linear Map and Rank
    • 6  Inner Product and Norm
    • 7  Orthogonality and Unitary Matrix
    • 8  Complementary Subspaces and Projection
    • 9  Orthogonal Complement and Decomposition
    • 10  SVD and Pseudoinverse
    • 11  Orthogonal and Affine Projection
    • 12  Determinants and Eigensystems
    • 13  Similarity and Diagonalization
    • 14  Normal and Hermitian Matrices
    • 15  Positive Definite Matrices
  • Calculus
    • 16  Derivatives
    • 17  Chain rule
  • Probability and Statistics
    • 18  Probability
    • 19  Random Variables
    • 20  Expectation
    • 21  Common Distributions
    • 22  Moment Generating Function
    • 23  Concentration Inequalities I
    • 24  Convergence
    • 25  Limit Theorems
    • 26  Maximum Likelihood Estimation
    • 27  Bayesian Estimation
    • 28  Expectation-maximization
    • 29  Concentration Inequalities II
  • Learning Theory
    • 30  Statistical Learning
    • 31  Bayesian Classifier
    • 32  Effective Class Size
    • 33  Empirical Risk Minimization
    • 34  Uniform Convergence
    • 35  PAC Learning
    • 36  Rademacher Complexity
  • Machine Learning
    • 37  Linear Discriminant
    • 38  Perceptron
    • 39  Logistic Regression
    • 40  Multi-layer Perceptron
    • 41  Boosting
    • 42  Support Vector Machine
    • 43  Decision Tree
    • 44  Principle Component Analysis
  • Deep Learning
    • 45  Transformer

Table of contents

  • Basic probability properties
  • Joint probability
    • Conditional probability
    • Marginal probability
    • Independence
  1. Probability and Statistics
  2. 18  Probability

18  Probability

Definition 18.1 The probability space (\Omega, \mathcal{F}, \mathbb{P}) of a random experiment consists of the following quantities.

  • The sample space \Omega is the set of all possible outcomes of the random experiment.

  • The set of events \mathcal{F}, where each event A is a subset of the sample space \Omega and we say that A occurs if the outcome \omega \in \Omega of the random experiment is an element of A. By definition, \mathcal{F} is a \sigma-algebra, which means it must satisfies the following requirements

    • Contains the sample space: \Omega \in \mathcal{F}.

    • Closed under complement: if A \in \mathcal{F}, then \Omega \setminus A \in \mathcal{F}.

    • Closed under countable unions: if A_{1}, \dots, A_{n} \in \mathcal{F}, then \bigcup_{i = 1}^{n} A_{i} \in \mathcal{F}.

  • A probability measure \mathbb{P}: \mathcal{F} \to [0, 1] is a function that assigns probabilities to the events in \mathcal{F} and satisfy the following axioms.

    • Non-negativity: \mathbb{P} (A) \geq 0, \forall A \in \mathcal{F}.

    • Normalization: \mathbb{P} (\Omega) = 1.

    • Countable additivity: If A_{1}, \dots, A_{n} are mutually disjoint, then \mathbb{P} \left( \bigcup_{i = 1}^{n} A_{i} \right) = \sum_{i = 1}^{n} \mathbb{P} (A_{i}).

If the sample space \Omega is finite, the probability space (\Omega, \mathcal{F}, \mathbb{P}) defined in Definition 18.1 is said to be discrete.

If the sample space \Omega is infinite, the probability space (\Omega, \mathcal{B}, \mathbb{P}) is said to be continuous if the set of events \mathcal{B} is a Borel \sigma-algebra. If the sample space \Omega is the real space \mathbb{R}, Borel \sigma-algebra is the smallest \sigma-algebra that contains all open sub-intervals in \Omega = \mathbb{R}.

Basic probability properties

Corollary 18.1 The probability of the union events is

\mathbb{P} (A \cup B) = \mathbb{P} (A) + \mathbb{P} (B) - \mathbb{P} (A \cap B).

Proof

Since A \cup B can be partitioned into 3 disjoint sets: A \setminus B, B \setminus A, and A \cap B, we can use the countable additivity in Definition 18.1 to prove the following

\mathbb{P} (A) = \mathbb{P} (A \setminus B) + \mathbb{P} (A \cap B) \\ \mathbb{P} (B) = \mathbb{P} (B \setminus A) + \mathbb{P} (A \cap B) \\ \mathbb{P} (A \cup B) = \mathbb{P} (A \setminus B) + \mathbb{P} (B \setminus A) + \mathbb{P} (A \cap B).

Therefore, we have

\begin{aligned} \mathbb{P} (A) + \mathbb{P} (B) - \mathbb{P} (A \cap B) & = \mathbb{P} (A \setminus B) + \mathbb{P} (B \setminus A) + \mathbb{P} (A \cap B) \\ & = \mathbb{P} (A \cup B). \end{aligned}

Corollary 18.2 (Union bound) The probability of an event that is a union of a finite set of events A_{1}, \dots, A_{n} is no greater than the sum of the probabilities of all events.

\mathbb{P} \left( \bigcup_{i = 1}^{n} A_{i} \right) \leq \sum_{i = 1}^{n} \mathbb{P} (A_{i}).

Proof

According to Corollary 18.1,

\begin{aligned} \mathbb{P} (A \cup B) & = \mathbb{P} (A) + \mathbb{P} (B) - \mathbb{P} (A \cap B) \\ & \leq \mathbb{P} (A) + \mathbb{P} (B) & [\mathbb{P} (A \cap B) \geq 0]. \end{aligned}

Therefore, we can extend the above fact for A and B to A_{1}, \dots, A_{n} to derive the union bound.

Joint probability

Usually we refer the probability that the events A, B occur at the same time as the joint probability of A and B

\mathbb{P} (A \cap B) = \mathbb{P} (A, B).

Conditional probability

Definition 18.2 (Conditional probability) Let B be an event such that \mathbb{P} (B) > 0. The conditional probability of the event A given B is defined to be

\mathbb{P} (A \mid B) = \frac{ \mathbb{P} (A \cap B) }{ \mathbb{P} (B) }.

Note that \mathbb{P} (\cdot \mid \cdot ) is also a probability measure and therefore satisfies the axioms of the probability measure in Definition 18.1.

Corollary 18.3 (Chain rule) Given a set of events A_{1}, \dots, A_{n}, the probability of their intersection can be calculated as

\mathbb{P} \left( \bigcap_{i = 1}^{n} A_{i} \right) = \prod_{j = 1}^{n} \mathbb{P} \left( A_{j} \;\middle|\; \bigcap_{k = j + 1}^{n} A_{k} \right)

where \mathbb{P} \left( A_{j} \;\middle|\; \bigcap_{k > n}^{n} A_{k} \right) = \mathbb{P} (A_{j}).

Proof

For the intersection between 2 events A and B, we have

\mathbb{P} (A \cap B) = \mathbb{P} (A \mid B) \mathbb{P} (B)

by the definition of Definition 18.2.

To apply it for the multiple events, we can treat A = A_{1} and B = \bigcap_{i = 2}^{n} A_{i} to get

\begin{aligned} \mathbb{P} \left( \bigcap_{i = 1}^{n} A_{i} \right) & = \mathbb{P} \left( A_{1} \;\middle|\; \bigcap_{i = 2}^{n} A_{i} \right) \mathbb{P} \left( \bigcap_{i = 2}^{n} A_{i} \right) \\ & = \mathbb{P} \left( A_{1} \;\middle|\; \bigcap_{i = 2}^{n} A_{i} \right) \mathbb{P} \left( A_{2} \;\middle|\; \bigcap_{i = 3}^{n} A_{i} \right) \mathbb{P} \left( \bigcap_{i = 3}^{n} A_{i} \right) \\ & = \mathbb{P} \left( A_{1} \;\middle|\; \bigcap_{i = 2}^{n} A_{i} \right) \mathbb{P} \left( A_{2} \;\middle|\; \bigcap_{i = 3}^{n} A_{i} \right) \dots \mathbb{P} (A_{n}) \\ & = \prod_{j = 1}^{n} \mathbb{P} \left( A_{j} \;\middle|\; \bigcap_{k = j + 1}^{n} A_{k} \right). \end{aligned}

Theorem 18.1 (Bayes’s theorem) Let A, B be events with nonzero probability. Then,

\mathbb{P} (A \mid B) = \frac{ \mathbb{P} (B \mid A) \mathbb{P} (A) }{ \mathbb{P} (B) }.

Proof

This follows directly from Definition 18.2

\mathbb{P} (A \mid B) = \frac{ \mathbb{P} (A \cap B) }{ \mathbb{P} (B) } = \frac{ \mathbb{P} (B \cap A) }{ \mathbb{P} (B) } = \frac{ \mathbb{P} (B \mid A) \mathbb{P} (A) }{ \mathbb{P} (B) }.

Marginal probability

Theorem 18.2 (Law of total probability) Let A_{1}, \dots, A_{n} be events that partition \Omega, that is, A_{1}, \dots, A_{n} are disjoint and there union is \Omega, the for any event B,

\mathbb{P} (B) = \sum_{i = 1}^{n} \mathbb{P} (B \cap A_{i}) = \sum_{i = 1}^{n} \mathbb{P} (B \mid A_{i}) \mathbb{P} (A_{i}).

Proof

Since A_{1}, \dots, A_{n} partition \Omega,

\begin{aligned} B & = B \cap \Omega & [B \subseteq \Omega] \\ & = B \cap \left( \bigcup_{i = 1}^{n} A_{i} \right) \\ & = \bigcup_{i = 1}^{n} \left( B \cap A_{i} \right). \end{aligned}

By the axiom of countable additivity in Definition 18.1 and Definition 18.2, we have

\mathbb{P} (B) = \sum_{i = 1}^{n} \mathbb{P} (B \cap A_{i}) = \sum_{i = 1}^{n} \mathbb{P} (B \mid A_{i}) \mathbb{P} (A_{i}).

Independence

Definition 18.3 (Independence) Two events A and B are said to be independent if any of the following holds.

  • \mathbb{P} (A \cap B) = \mathbb{P} (A) \mathbb{P} (B).

  • \mathbb{P} (A \mid B) = \mathbb{P} (A).

  • \mathbb{P} (B \mid A) = \mathbb{P} (B).

Definition 18.4 (conditional-independence) Two events A and B are said to be conditional independent given an event C if any of the following holds.

  • \mathbb{P} (A \cap B \mid C) = \mathbb{P} (A \mid C) \mathbb{P} (B \mid C).

  • \mathbb{P} (A \mid B \cap C) = \mathbb{P} (A \cap C).

  • \mathbb{P} (B \mid A \cap C) = \mathbb{P} (B \cap C).

We can use independence properties to calculate the probability of the intersection of independent events by calculating the product of their probabilities.

17  Chain rule
19  Random Variables